1. Introduction


This tutorial provides an approach to basic analysis for single-cell RNA sequencing data:

  1. Data pre-processing
  • QC and selecting cells
  • Normalizing the data
  • Identifying high variable genes(feature selection)
  • Scaling the data
  1. Linear dimensional reduction
  • PCA
  • Determine dimensionality of PCs
  1. Cluster the cells

  2. Visualization

  3. Differentially expressed genes

  4. Identify the cell types

  5. Enrichment analysis

  • Gene Ontology
  • Pathway Enrichment

This tutorial refers to Seurat, a software(package) for analyzing single-cell RNA seq data in R.

Feel free to contact me if you have any questions. Email: Wechat: rick__sanchez__

2. Preparation


2.1 Software

All the anaylsis below relies on the R environment. Please download R and related software before the coures.

We select the tsinghua mirror to download R

  1. Download and install R
  • Windows user: Please click Download R for Windows, and then click install R for the first time.
  • Mac user: Please click Download R for (Mac) OS X, and then click R-3.6.1.pkg.
  1. Download and install Rstudio, and then select RStudio Desktop Open Source License FREE to download Windows Version RStudio 1.3.1056 - Windows 7+ (64-bit) or Mac Version RStudio 1.3.1056 - macOS 10.13+ (64-bit)

  2. Install the R packages for the following analysis.

    We also use the tsinghua mirror to download R packages for saving times and avoiding the unnecessary Error. Open the areadly installed Rstudio. Paste and execute the code elow in the Console panel:

2.2 Data

In this tutorial, we provide a dataset Peripheral Blood Mononuclear Cells(PBMC) freely available from 10X Genomics. There are ~5k single cells that were sequencing on the Illumina NextSeq500. You can download the data from the 10X Genomics.

  1. Open 10X official website for downloading data.
  2. Select section DATASETS and click View all.
  3. When you enter this page, you should fill in the relevant information: Name, Your Email, Institution, Country. Click no for “receive update” and agree the “Privacy Policy”.
  4. Select 5k Peripheral blood mononuclear cells (PBMCs) from a healthy donor (v3 chemistry). And then select Feature / cell matrix (filtered). Warning: This Object size is 41.21 MB.

And then you should extract the compressed file into a directory, such as D:/Tutorial/scRNA_seq_tutorial/counts/filtered_feature_bc_matrix(this directory need to be created).

Then, we can check how the 10X single-cell RNA seq data is stored. we open the directoryD:/Tutorial/scRNA_seq_tutorial/counts/filtered_feature_bc_matrix, and we will find that three compressed files store their respective files:

  • barcodes.tsv.gz is used to store the information of each cell barcode. In general, it store the name of each single-cell.
  • features.tsv.gz is used to store the genetic information obtained by sequencing.
  • matrix.mtx.gz is used to store the data of single-cell expressed genes, and it is a storage format for sparse matrices using the Coordinate Format(COO If you are interested in this format, click it.)

3. Start


3.2 Setup the Seurat Object

At first, We have to set a correct working directory.

Then we can load the data by calling Read10X(). This function can read the three files(barcodes.tsv.gz, features.tsv.gz, matrix.mtx.gz) in the directory at once.

Then we can setup the seurat object by calling CreateSeuratObject().

We highly encourage querying the detail of a function by calling help(function_name) or?function_name, such as help(CreateSeuratObject).

## An object of class Seurat 
## 19037 features across 5100 samples within 1 assay 
## Active assay: RNA (19037 features)

4. Data pre-processing


These workflow represent the selection and filtration of cells based on QC metrics, data normalization and data scaling, and the detection of highly variable genes.

4.1 QC and selecting cells

Seurat provides simple QC metrics and allow you to filter cells based on user-defined criterion. A few QC metrics include commonly:

  • The number of unique genes detected in each cell(each droplet):
    • Low-quality cells or empty droplets will often have very few gene
    • Cell doublets or multiplets may have an aberrantly high gene count
  • The above assumptions also apply to the total number of molecules in each cell(each droplet).
  • The percentage of reads that map to the mitochondrial genome
    • Low-quality or dying cells often exhibit extensive mitochondrial contamination

we call PercentageFeatureSet to caculate the percentage of mitochondrial genes. Mitochondrial genes are a set genes which can be captured by matching the prefix MT-.

##                   orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACCCAAGACAGCTG sc_tutorial       8384         2550   9.148378
## AAACCCAAGTTAACGA sc_tutorial      11768         3017   6.296737
## AAACCCACAGTCGCAC sc_tutorial       6947         2033   5.570750
## AAACCCAGTAGCTTGT sc_tutorial       9354         2193   7.750695
## AAACCCATCTGCCCTA sc_tutorial       8970         2630   7.781494
## AAACGAAAGCAATTAG sc_tutorial       9291         2286  11.634916

In this data, we visualize the QC metric and use these filter cells.

nCount_RNA means that the total number of reads per cells map to the genome. nFeature_RNA means that the total number of genes expressed per cell. percent.mt means that the percentage of mitochondrial genes per cell.

  • we filter cells that have nFeature_RNA over 5000
  • we filter cells that have nFeature_RNA less than 200
  • we filter cells that have percent.mt over 20%

We can find that samples is 4591 now, before QC samples is 5100.

## An object of class Seurat 
## 19037 features across 4591 samples within 1 assay 
## Active assay: RNA (19037 features)

4.2 Normalizing the data

After moving unwanted cells, the next step is to normalize the data. We employ a global-scaling normalizaiton method LogNormalize that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a size factor(10,000 by default), and log-transforms the result. Normalized values are stored in the adata[["RNA"]]@data which is another sparse matrix.

4.3 Identifying high variable genes

We next caculate a subset of geens that exhibit high cell-to-cell variation in the data(they are highly expressed in some cells,and lowly expressed in others). The high varialble genes can highlight biological signal in single-cell data. we can find high variable genes(HVGs) by calling FindVariableFeature() function. By default, we return top 3000 HVGs in single-cell data. These genes will be used in downstream analysis, like PCA. The resluts are stored in adata[["RNA"]]@var.features.

4.4 Scaling the data

Next, we apply a linear transformation(“scaling”) that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The scaleData function:

  • Shift the expression of each gene, so that the mean expressiong across cells is 0
  • Scales the expression of each gene, so that the variance across cells is 1 The results are stored in the adata[["RNA"]]@scale.data
  • why? This step is to prevent some values that are too large to capture biological information stably. But there are also agruments that is better reuslt without scale.

Brief summary

In the data pre-processing workflow, we do

  1. QC and selecting cells This step is to remove th low-quality or outlier cells.
  2. Normalizing the data This step is to smooth gene expression, so that the highly-expressed genes are not “so high”
  3. Identifying high variable genes(feature selection) This step is to highlight biological signal in single-cell data
  4. Scaling the data This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate

5. Linear dimensional reduction


5.1 PCA

Next, we perform PCA on the scale.data. By default, only the previously detemined variable features are used as input. we select the top 3000 HVGs as input.

5.2 Determine dimensionality of PCs

PCA is feature selection techniques whose purpose is to reduce noise, extract features, data compression, increase computing speed and so on. Each PC is linear combination that combines information across a correlated gene set. The top PCs therefore represent a robust comprassion of the dataset. How many components should we choose to include? 10? 20? 100? We can call function ElblowPlot to generates a elbow plot: a ranking of the PCs based on the percentages of variance. In this dataset, we can observe an “elbow” around PC20, suggesting that the majority features are captured in the first 20 PCs.

6. Clustering


Seurat applies a graph-based clustering approach. 1. They first construct a NN graph based on the Euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighborhoods. Call FindNeighbors() to construct a NN graph. 2. To cluster the cells, they apply the modularity optimization algorithm ( Louvain algorithm, default) to cluster the cells. We can call FindClusters() to do this step. And a parameter resolution is set in the FindClusters(), with increased values leading to a greater number of clusters.

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 4591
## Number of edges: 162469
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9151
## Number of communities: 13
## Elapsed time: 0 seconds
##                   orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACCCAAGACAGCTG sc_tutorial       8384         2550   9.148378
## AAACCCAAGTTAACGA sc_tutorial      11768         3017   6.296737
## AAACCCACAGTCGCAC sc_tutorial       6947         2033   5.570750
## AAACCCAGTAGCTTGT sc_tutorial       9354         2193   7.750695
## AAACCCATCTGCCCTA sc_tutorial       8970         2630   7.781494
## AAACGAAAGCAATTAG sc_tutorial       9291         2286  11.634916
##                  RNA_snn_res.0.4 seurat_clusters
## AAACCCAAGACAGCTG               2               2
## AAACCCAAGTTAACGA               1               1
## AAACCCACAGTCGCAC               0               0
## AAACCCAGTAGCTTGT               8               8
## AAACCCATCTGCCCTA               6               6
## AAACGAAAGCAATTAG               6               6

7. Visualization


Seurat offers several non-linear dimensional reduction techniques, such as tSNE and UMAP, to visualize and explore these datasets. The goal of these algorithms is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space.

call function RunUMAP

If you haven’t installed UMAP, you can do so via reticulate::py_install(packages = 'umap-learn')

call function RunTSNE

9. Identify the cell types


In this PBMC data, we can use the canonical markers to identify cell types and to mach the cell clusters.

Cell Clusters Gene Markers Cell Types
5,6 MS4A1 B
4 GNLY, NKG7 NK
1 CD14, LYZ CD14+ Mono
0,8,9 IL7R, CCR7 Naive CD4+T
10,12 FCER1A, CST3 DC
7 MS4A7, FCGR3A FCGR3A+ Mono
3 IL7R, S100A4 Memory CD4+
2 CD8A CD8+ T
11 PPBP Platelet

10. plotting the cell types


So we can plot the cell types labels on the each cluster

11. Enrichment Analysis


We have to transform the gene SYMBOL to ENTREZID before enrichment

GO enrichment